Model Selection

High-precision speech recognition

# High-precision speech recognition

Parakeet Rnnt 1.1b

Parakeet RNNT 1.1B is an automatic speech recognition model jointly developed by NVIDIA NeMo and Suno.ai, based on the FastConformer Transducer architecture with approximately 1.1 billion parameters, supporting English speech transcription.

Speech Recognition English

Stt En Fastconformer Transducer Xlarge

The NVIDIA FastConformer-Transducer is a high-performance model for English automatic speech recognition (ASR), utilizing an optimized FastConformer architecture and Transducer decoder with approximately 618 million parameters.

Speech Recognition English

Stt En Fastconformer Ctc Xlarge

NVIDIA FastConformer-CTC XLarge is an Automatic Speech Recognition (ASR) model with approximately 600 million parameters, designed specifically for English speech transcription and trained using the FastConformer architecture and CTC loss.

Speech Recognition English

Stt En Fastconformer Ctc Large

This is a large automatic speech recognition (ASR) model based on the FastConformer architecture, specifically designed for transcribing English speech into text.

Speech Recognition English

Stt En Fastconformer Transducer Large

This is a large automatic speech recognition (ASR) model based on the FastConformer architecture, specifically designed for transcribing English speech into text.

Speech Recognition English

Whisper Large V2 Japanese 5k Steps

A speech recognition model fine-tuned on the Japanese CommonVoice dataset based on OpenAI's whisper-large-v2 model, trained for 5000 steps with a word error rate of 0.7449

Speech Recognition

Transformers Japanese

Stt En Conformer Transducer Xlarge

This is an Automatic Speech Recognition (ASR) model developed by NVIDIA, based on the Conformer-Transducer architecture, with approximately 600 million parameters, specifically designed for English speech transcription.

Speech Recognition English

Asr Wav2vec2 Librispeech

This is an end-to-end automatic speech recognition system trained on the LibriSpeech dataset, combining the wav2vec 2.0 pre-trained model and CTC technology, excelling in English speech recognition tasks.

Speech Recognition English

Wav2vec2 Large 960h Lv60 Self With Wikipedia Lm

An automatic speech recognition (ASR) system based on Facebook's wav2vec2-large-960h-lv60-self model, improved with an enhanced Wikipedia language model

Speech Recognition

Wav2vec2 Conformer Rope Large 100h Ft

Wav2Vec2 Conformer model fine-tuned on 100 hours of Librispeech data, incorporating rotary position embedding technology

Speech Recognition

Transformers English

Wav2vec2 Conformer Rope Large 960h Ft

This model incorporates rotary position embedding technology, is pre-trained and fine-tuned on 960 hours of LibriSpeech data sampled at 16kHz, and is suitable for English speech recognition tasks.

Speech Recognition

Transformers English

Wav2vec2 Conformer Rel Pos Large 960h Ft

A Wav2Vec2-Conformer model based on 16kHz sampled speech audio, using relative positional embedding technology, pre-trained and fine-tuned on 960 hours of Librispeech data

Speech Recognition

Transformers English

Wav2vec2 Large 960h Lv60 Self 4 Gram

Based on Facebook's Wav2Vec2-Large-960h-lv60-self model, enhanced with an English 4-gram language model to improve speech recognition accuracy

Speech Recognition English

patrickvonplaten

Wav2vec2 Base 960h 4 Gram

Based on Facebook's Wav2Vec2-Base-960h model, with an added English 4-gram language model to improve automatic speech recognition (ASR) accuracy.

Speech Recognition

Transformers English

patrickvonplaten

Stt En Conformer Ctc Large

This is a large automatic speech recognition (ASR) model based on the Conformer architecture, supporting English speech transcription and trained using CTC loss function.

Speech Recognition English

Data2vec Audio Large 960h

Data2Vec is a general self-supervised learning framework applicable to speech, vision, and language tasks. This large audio model is pre-trained and fine-tuned on 960 hours of LibriSpeech data, specifically optimized for automatic speech recognition tasks.

Speech Recognition

Transformers English

Iwslt Asr Wav2vec Large 4500h

A large-scale English automatic speech recognition model based on the Wav2Vec2 architecture, fine-tuned on 4500 hours of multi-source speech data, supporting decoding with a language model

Speech Recognition

Transformers English

Simpleoier Librispeech Asr Train Asr Conformer7 Wavlm Large Raw En Bpe5000 Sp

An automatic speech recognition (ASR) model trained on the ESPnet framework, using the Conformer architecture and the WavLM large pre-trained model, trained on the LibriSpeech dataset.

Speech Recognition English

Wavlm Libri Clean 100h Large

Automatic speech recognition model fine-tuned on the LIBRISPEECH_ASR - CLEAN dataset based on microsoft/wavlm-large

Speech Recognition

patrickvonplaten

Personal Speech To Text Model

A personal speech-to-text model fine-tuned from facebook/wav2vec2-large-robust-ft-swbd-300h, optimized for specific accents.

Speech Recognition

Wav2vec2 Large 960h Lv60

Wav2Vec2 is a powerful speech recognition model that extracts features from raw audio through self-supervised learning and achieves high-performance speech recognition with limited labeled data.

Speech Recognition English

Wavlm Libri Clean 100h Base

An automatic speech recognition model fine-tuned on the LIBRISPEECH_ASR - CLEAN dataset based on microsoft/wavlm-base

Speech Recognition

patrickvonplaten

Hubert Large Ls960 Ft

HuBERT-Large is a self-supervised speech representation learning model fine-tuned on 960 hours of LibriSpeech data for automatic speech recognition tasks.

Speech Recognition

Transformers English

Wavlm Libri Clean 100h Base Plus

An automatic speech recognition model fine-tuned on the LIBRISPEECH_ASR - CLEAN dataset based on microsoft/wavlm-base-plus

Speech Recognition

patrickvonplaten

Wav2vec2 Base 960h

The Wav2Vec2 base model developed by Facebook, pre-trained and fine-tuned on 960 hours of LibriSpeech audio for English automatic speech recognition tasks.

Speech Recognition

Transformers English

Wav2vec2 Large 960h Lv60 Self

The Wav2Vec2 large model developed by Facebook, pre-trained and fine-tuned on 960 hours of Libri-Light and Librispeech audio data, using self-training objectives, achieving SOTA results on the LibriSpeech test set.

Speech Recognition English

Wav2vec2 Base 960h

Wav2Vec2 is a self-supervised learning-based speech recognition model developed by Facebook, trained on the LibriSpeech dataset, supporting English speech-to-text tasks.

Speech Recognition

Transformers English

Hubert Xlarge Ls960 Ft

A fine-tuned HuBERT extra-large speech recognition model based on 960 hours of Librispeech data, achieving a WER of only 1.8 on the LibriSpeech test set

Speech Recognition

Transformers English

Data2vec Audio Base 960h

Data2Vec is a general self-supervised learning framework applicable to speech, vision, and language processing. This model is a speech recognition model pre-trained and fine-tuned on 960 hours of LibriSpeech audio data.

Speech Recognition

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase